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Abstract 

We present economical iterative algorithms built on the Biconjugate 
^4-Orthonormalization Procedure for real unsymmetric and complex non- 
Hermitian systems. The principal characteristics of the developed solvers is 
that they are fast convergent and cheap in memory. We report on a large 
combination of numerical experiments to demonstrate that the proposed 
family of methods is highly competitive and often superior to other popular 
algorithms built upon the Arnoldi method and the biconjugate Lanczos 
procedures for unsymmetric linear sytems. 
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1 Introduction 



In this study we investigate variants of the Lanczos method for the iterative 
solution of real unsymmetric and/or complex non-Hermitian linear systems 

Ax = b, (1) 

with the motivation of obtaining smoother and, hopefully, faster convergence 
behavior in comparison with the BiCG method as well as its two evolving variants 
- the CGS method and one of the most popular methods in use today - the 
Biconjugate Gradient Stabilized (BiCGSTAB) method. 

Iterative methods for solving unsymmetric systems are commonly developed 
upon the Arnoldi or the Lanczos biconjugate algorithms. These procedures 
generate an orthonormal basis for the Krylov subspaces associated with A 
and an initial vector v, and require only matrix- vector products by A. The 
generation of the vector recurrence by Arnoldi produces Hessenberg matrices, 
while the unsymmetric Lanczos biconjugation produces tridiagonal matrices. The 
price to pay due long recurrences in Arnoldi is the increasing orthogonalization 
cost along the iterations. In this paper we develop economical iterative 
algorithms built on the Biconjugate A-Orthonormalization Procedure presented 
in Section [2] The method ideally builds up a pair of Biconjugate A-Orthonormal 
(or, briefly, A-biorthonormal) basis for the dual Krylov subspaces K m (A;vi) and 
A T K m (A T ; wi) in the real case (which is A H K m (A H ; wi) in the complex case). 
The projection matrix onto the corresponding Krylov subspace is tridiagonal, so 
that the generation of the vector recurrences is extremely cheap in memory. We 
provide the theoretical background for the developed algorithms and we discuss 
computational aspects. We show by numerical experiments that the Biconjugate 
yl-Orthonormalization Procedure may lead to highly competitive solvers that are 
often superior to other popular methods, e.g. CGS, BiCGSTAB, IDR(s). We 
apply these techniques to sparse and dense matrix problems, both in real and 
complex arithmetic, arising from realistic applications in different areas. This study 



integrates and extends the preliminary investigations reported in 12 , limited to 
the case of complex non-Hermitian systems. In this paper we consider a much 
larger combination of experiments with both real and complex matrices having 
size order twice as large. Finally, we complete our work with a case study with 
dense linear systems in electromagnetic scattering from large structures. 

The paper is structured as follows: in Section [2] we present the Biconjugate 
yl-Orthonormalization Procedure and its properties; in Section [3] we describe a 
general framework to derive linear solver from the proposed procedure, and we 
present the algorithmic development of two Krylov projection algorithms. Finally, 



in Section [6] we report on extensive numerical experiments for solving large sparse 
and/or dense linear systems, both real and complex. 



2 A general two-sided unsymmetric Lanczos 
biconjugate A-orthonormalization procedure 



Throughout this paper we denote the standard inner product of two real vectors 
m,d6K" as 

n 

(u, v) = U T V = UiVi. 

i=l 

Given two vectors v\ and W\ with euclidean inner product (u\,Avi) = 1, we 
define Lanczos-type vectors Vj, Wj and scalars 5j, (3j, j = 1, 2, . . . , m by the following 
recursions 



S j+1 v j+1 = Avj - pjVj-! - ajvj = s j+1 , (2) 
Pj+iWj+i = A T Wj — SjWj-i — oijWj = tj+x. (3) 



where the scalars are chosen as 



at, = (uj,A(Avj)) , S j+1 = \(u j+u Aij j+1 )\» , p j+1 



(u j+1 ,Av j+1 ) 



This choice of the scalars guarantees that the recursions generate sequences of 
biconjugate A-orthonormal vector (or briefly, A-biorthonormal vectors) Vi and Wi, 
according to the following definition 

Definition 1 Right and left Lanczos-type vectors Vj, j = 1,2, ... ,m and w^, i = 
1,2, ... ,m form a biconjugate A-orthonormal system in exact arithmetic, if and 
only if 



(ui, Avj) 



1 < i,j < m. 



Eqns. (|2j|3j) can be interpreted as a two-side Gram-Schmidt orthonormalization 
procedure where at step i we multiply vectors Vi and by A and A T , respectively, 
and we orthonormalize them against the most recently generated Lanczos-type 



pairs (vi, Wi) and w%-x)- The vectors ajUj, CKjWj are the complex biconjugate A- 
orthonormal projections of Avt and A T w $ onto the most recently computed vectors 
t>i and it)j; analogously, the vectors (3iVi-i, f^u^-i are the complex biconjugate A- 
orthonormalization projections of Av,- t and A T Wi onto the next computed vectors 
and The two sets of scalars satisfy the following relation 

A+i^i+i — sJ +1 Ati + i = (si+i,Ati+i) . 

The scalars /3j and 5j can be chosen with some freedom, provided the biconjugate 
^4-orthonormalization property holds. We sketch the complete procedure 
in Algorithm [T] 

Algorithm 1 The Lanczos A-biorthonormalization procedure 
1: Choose v\,uji, such that (ui,Av\) = 1 
2: Set fit = 8 X = 0, u = v = e R n 
3: for j — 1,2, ... do 

4: Oj = (Wj, A (AVj)) 

5: -Oj+i = yitij — ctjVj — (3jVj_i 

6: = A T Uj — OijUJj — SjUj-i 

7: = \(u j+1 ,Av j+ i)\* 

(uj j+1 , Av j+1 ) 



9: 



Oj+i 

10: u; i+a = -f- 

Pj+l 

11: end for 



Notice that there is a clear analogy with the standard unsymmetric biconjugate 
Lanczos recursions. The matrix A is not modified and is accessed only via matrix- 
vector products by A and A T . Similarly to the standard Lanczos procedure, the two 
most recently computed pairs of Lanczos-type vectors Vy. and Wk for k = i, i — 1 are 
needed at each step. These two vectors may be overwritten with the most recent 
updates. Therefore the memory storage is very limited compared to the Arnoldi 
method. The price to pay is some lack of robustness due to possible vanishing of 
the inner products. Observe that the above algorithm is possible to breakdown 
whenever 5j + i vanishes while Wj + i and Aiij +1 are not equal to G lR n appearing 
in line 7. In the interest of counteractions against such breakdowns, refer oneself 
to remedies such as so-called look-ahead strategies [9j[TTJ[l5j[l6] which can enhance 
stability while increasing cost modestly, or others for example (3l. But that is 
outside the scope of this paper and we shall not pursue that here; for more details, 



please refer to 18 and the references therein. In our experiments we never observed 



a breakdown of the algorithm, as we will see in Section [6j However, it is fair to 
mention that this problem may occurr. The following proposition states some 
useful properties of Algorithm [T] 

Proposition 1 If Algorithm^ proceeds m steps, then the right and left Lanczos- 



type vectors Vj,j = 1,2, ...,m and Wi,i = 
orthonormal system in exact arithmetic, i.e., 



1,2, 



m form a biconjugate A- 



(ui,Avj) = 5ij, 1 < i,j < m. 

Furthermore, denote by V m = [v i, v 2 , • • • , %] and W m = [wi,w 2 ,. 
n x m matrices and by T m the extended tridiagonal matrix of the form 



where 



T 



T 



w m ] the 
(4) 



T 

1 Tt 



Oil 

5 2 a 2 



3m— 1 Ctm—1 Pn 
0~m. CX r. 



whose entries are the coefficients generated during the algorithm 
implementation, and in which a±, . . . , a m , (3 2 , ■ ■ ■ , f3 m are complex while 5 2 , ■ ■ ■ , S m 
positive. Then with the Biconjugate A-Orthonormalization Procedure, the following 
four relations hold 



V m T m -\- 5 m ^ r iV m ^ie m 
A T W m = W m Tl + (5 m+l uj rn+l e T m 
WlAV m = I m 
WlA 2 V m = T m 



(5) 
(6) 
(7) 



Proof. See 12 . 



A characterization of Algorithm [T] in terms of projections into relevant Krylov 
subspaces is derived in the following 



Corollary 1 The biconjugate Lanczos A-orthonormalization method ideally builds 
up a pair of biconjugate A-orthonormal bases for the dual Krylov subspaces 
K m (A;vi) and A T K m (A T ;wi). The matrix T m is the projection of the matrix A 
onto the corresponding Krylov subspaces. 



The Lanczos A-biorthonormalization method 
for general linear systems 



From the recursions denned in Eqns (|2j)-([3j) and the characterization given 
by Corollary 0, we derive a Lanczos A-biorthonormalization method for general 
linear systems along the following lines. 

STEP 1 Run Algorithm [T] m steps for some user specified m <C n and generate 
Lanczos- type matrices V m = [vi,v 2 , ■ ■ ■ , v m ), W m = [wi, w 2 , ■ ■ ■ , w m ] and the 
tridiagonal projection matrix T m defined in Proposition [l] 

STEP 2 Compute the approximate solution x m that belongs to the Krylov 
subspace x + K m (A;vi) by imposing the residual be orthogonal to the 
constraint subspace L m = A T K m (A T ;wi) 

r m b Ax m _L L mj 
or equivalently in matrix formulation 

(A T W m f (b - Ax m ) =0. (9) 
Recall that the approximate solution has form 



(10) 



so that by simple substitution and computation with Eqns (|8 10 ) we obtain 
a tridiagonal system to solve for y m , 



T m y m = (3e 1 , (3 



l r o| 



STEP 3 Compute the new residual and if convergence is observed, terminate the 
computation. Otherwise, enlarge the Krylov subspace and repeat the process 
again. 

The whole iterative scheme is sketched in Algorithm |2j It solves not only the 
original linear system Ax = b but, implicitly, also a dual linear system A T x* = b* 
with A T . Analogously, we can derive the counterpart of Algorithm[2]for the solution 
of the corresponding dual system A T x* = b*, where the dual approximation x* m is 
sought from the affine subspace x* + K m (A T ,wi) of dimension m satisfying 

b*-A T x* m JL AK m (A, Vl ). 

Denote r$ = b* — A T xl and /3* = H^oL- ^ w i = and V\ is chosen properly 
such that (vi, Aw\) = 1, then the counterparts of p 11) are the following 



(AV m ) T (b* - A T x* m ) = (12) 
x* m = x* + W m y* m , (13) 

TmVm = P*ei (14) 

where V m , W m and T m are defined in Proposition [T] and y* m e W 1 is the coefficient 
vector of the dual linear combination. In Section [6] we will derive a formulation 
that does not require multiplications by A T . 



Algorithm 2 Two-sided Biconjugate A-Orthonormalization method. 
1: Compute r = b — Ax for some initial guess x and set (3 = \\r \\ 2 . 
2: Start with v± — ^ and choose co± such that (ui,Avi) = 1, (for example, 

3: Generate the Lanczos-type vectors V m = [vi, v 2 , ■ ■ ■ , v m ] and W m = 
[wi, w 2 , • • • , w m ) as well as the tridiagonal matrix T m defined in Proposition [l] 
by running Algorithm 1 m steps. 

4: Compute y m = T~ 1 (/3e x ) and x m = x + V m y m . 

The approximation x m and the dual approximation be updated 

respectively from x m -i and x* m _ x at each step. Assume the LU decomposition 
of the tridiagonal matrix T m is 

T — T TT 



substituting which into (10 



11) and (13 



14) gives respectively 



x + V m (L m U m ) 1 {/3e 1 ) 
x + VmU^L^ 1 (pet) 

xl + WmiUlLiy'^e,) 

xl + W m {L T m y l [UlY 1 G8* ei ) 

* i * 
•^0 *m Z mi 



V U~ l Z 



L m l {p ei ), and P* m = W m (Ll) \z* m 



where P m 

Algorithm in 18]-Chapter 6, we easily derive the relations 



Using the same argument as in the derivation of DIOM from IOM 



x j 

X, 



+ CmP 



1 i Smfmi 

* _|_ f* jf 
m—1 1 Smfmi 



where £ m and £^ are coefficients, p m and p* m are the corresponding mth column 
vectors in P m and P^ defined above, termed as the mth primary and dual direction 
vectors, respectively. 

Observe that the pairs of the primary and dual direction vectors form a 
biconjugate A 2 -orthonormal set, i.e., (p*,A 2 pj) = 6ij (1 < i,j < m), which follows 
clearly from 

{P* m ) T A 2 P m = (W m (LlY'Y A 2 V m U- 1 
= L^WlA^U' 1 

-L-'m ± m'- J rn 

— T~ l T TT TT~ l 

I- m 

with ([8]). In addition, the mth primary residual vector r m = b — Ax m and the 
mth dual residual vector r* m = b* — Ax* m can be expressed as 



c T 
'Om+l^mVmVm+li 

R T * 



(15) 
(16) 



by simple computation with (|5] |6"|l0||ll||13||14[ ). Eqns (15) and (16) together 
with ([7]) reveal that the primary and dual residual vectors satisfy the biconjugate 
A-orthogonal conditions, i.e. (r*,Arj) = for i ^ j. 

It is known that the constraint subspace in an oblique projection method 
is different from the search subspace and may be totally unrelated to it. The 
distinction is rather important and gives rise to different types of algorithms 



The biconjugate A-orthogonality between the primary and dual residual vectors 
and the biconjugate y4 2 -orthonormality between the primary and dual direction 
vectors reveal and suggest alternative choices for the constraint subspace. This 
idea helps to devise the BiCOR method described in the coming section. 



4 The BiCOR method 

Proceeding in a similar way like the Conjugate Gradient and its variants CG, 
CR, COCG, BiCGCR, COCR and BiCR, given an initial guess xq to the considered 
linear system Ax = b, we derive algorithms governed by the following coupled two- 
term recurrences 



r = b- Ax , p = r , (17) 
Xj+i = xj + OijPj, (18) 

r i+l = T 3 ~ a 3 A Vj, (19) 

p j+1 = r j+1 + Pj-pj, for j = 0, 1, . . . (20) 

where rj = b — Axj is the jth residual vector and pj is the jth search direction 
vector. Denoting L m the underlying constraints subspace, the parameters a iy fa 
can be determined by imposing the following orthogonality conditions: 

rj+i 1 L m and Ap j+1 _L L m . (21) 
For concerned real unsymmetric linear systems 

Ax = b, (22) 

an expanded choice for the constraint subspace is L m = A T K m (A T , rjj) , where 
is chosen to be equal to P(A)ro, with P(t) an arbitrary polynomial of certain 
degree with respect to the variable t and pi = r$. It should be noted that the 
optimal choice for the involved polynomial is in general not easily obtainable and 
requires some expertise and artifice. This aspect needs further research. When 
there is no ambiguity or other clarification, a specific default choice for L m with 



Atq is adopted in the numerical experiments against the other popular choice 



for 



L m with Tq 



ro, sec e.g. 22,23 . It is important to note that the scalars 



OLj, 



/3j(J — 0, 1, . . .) in the recurrences (18 20) are different from those produced by 
Algorithm [TJ The search direction vectors p'jS here are multiples of the primary 
direction vectors p'^s defined in Section [3j The coupled two-term recurrences for 
the (j + l)th shadow residual vector r* +1 and the associated (j + l)th shadow search 
direction p* +1 can be augmented by similar relations to (19 20) as follows: 



T 3+l 
Pj+1 



Q>jA T p 



V 



r i+i + PjP% for J = °> X > 



(23) 
(24) 



where cnj and (3j are the conjugate complex of a 3 - and (3j in (19 20), 
correspondingly. Then with a certain polynomial P(t) with respect to t, (21) 
explicitly reads 



r j+1 ± A T K m (A T , r*) and Ap j+1 ± A T K m (A T , r*) with r* = P(A)r (25) 
which can be reinterpreted from a practical point of view as 

• the residual vectors r,'s and the shadow residual vectors r*'s are biconjugate 
^4-orthogonal to each other, i.e., (r*, Ati) = (A T r*, r^) = 0, for i ^ j; 

• the search direction vectors p[s and the shadow search direction vectors p*'s 

, 2 



form an A 2 -biconjugate set, i.e., (j>*,A 2 pi) = (A T p*,Api) = i^(A T ) p*j,p, 
0, for i 7^ j. 

These facts are already stated in the latter part of Section [3j Therefore, 
we possess the conditions to determine the scalars <x,- and (3j by imposing the 
corresponding biorthogonality and biconjugacy conditions (25) into ( |19p0p3j|24[ ). 
We use extensively the algorithmic schemes introduced in [8] for descriptions of the 
present algorithms. 

Making the inner product of A T r* and r J+ i as defined by (19) yields 

(A T r*,r j+1 ) = {A T r*,r 3 - a 3 Ap 3 ) = 0, 
with the biconjugate A-orthogonality between r J+1 and r}, further resulting in 

(A T r*,r 3 ) 



[A T r*, Ap 3 ) 



where the denominator of the above right-hand side can be further modified as 

(A T r*, A Pj ) = (A T p* - ft^ATp^Apj) = (A T p*,A Pj ) , 
because p*_ ± and pj are y4 2 -biconjugate. Then 



a ; 



(ATp*,A Pj ) (ATp*,A Pj ) 



(26) 



Similarly, writing that as defined by (20) is v4 2 -biconjugate to p* yields 
(p*, A 2 p j+1 ) = (A T p*, Ap 3+1 ) = (A T p*, Ar j+1 + frApj) = 0, 



giving 



Pi 



[A T p*,Ar j+l ) 
(ATp*,A Pj ) 



-an 



A T p* } Ar J+1 ) 



r^Arj) 



with ay computed in ( 26 ) . 
Observe from (23) that 



-ajA T p* 



and therefore, 



Pi 



(-a J A T p*,Ar j+1 ) _ (r* +1 - r* , Ar j+1 ) _ (r* +1 ,Ar j+l ) 



(27) 



3 ( r h Ar j) ( r h Ar j) " { r h Ar i) 

because of the biconjugate A-orthogonality of r* and r J+1 . Putting these 
relations ( Jl7]|2^p3p4|26||27| ) together and taking the strategy of reducing the 
number of matrix-vector multiplications by introducing an auxiliary vector 
recurrence and changing variables, together lead to the BiCOR method. The 
pseudocode for the left preconditioned BiCOR method with a preconditioner M is 
given in the following Algorithm |3j 



Algorithm 3 Left preconditioned BiCOR method. 
1: Compute r = b — Axq for some initial guess Xq. 

2: Choose Tq = P(A)r such that (r^Aro) ^ 0, where P(t) is a polynomial in t. 

(For example, 7q = Ar ). 
3: for j — 1, 2, ... do 

4: Solve MZj-i = 7j_i 

5: if j = l then 

6: solve M t Zq = Tq 

7: end if 

8: Z — Azj-i 

9: Pj-l = {z* j . 1 ,z) 

10: if pj i = 0, method fails 

11: if j — 1 then 

12: p = ^0 

13: ^ = 4 

14: 50 = Z 

15: else 

16: pj- 2 = Pj-l/Pj-2 

17: = + (3j- 2 Pj-2 

18: p*_, = 4_! + &-2P;-2 

19: ^ = 5 + fij~2qj-2 

20: end if 

21: = A T p*_ i 

22: solve M T u*_ l = gf*_ 1 

23: aj_i = Pj-i/(uJ_i,fc--i) 
24: = + Q!j_iPj_i 

25: rj = rj_i - aj-i^-i 

26: = - ttj.iU^! 

27: check convergence; continue if necessary 

28: end for 



5 A transpose-free variant of the BiCOR 
method: the CORS method 



Exploiting similar ideas to the ingenious derivation of the CGS method 27 , 
one variant of the BiCOR method can be developed which does not require 
matrix-vector products by A T . The new algorithm is referred to as CORS and 
is derived using a different polynomial representation of the residual with the hope 
of increasing the effectiveness of BiCOR in certain circumstances. First, the CORS 
method follows exactly a similar way in [27] for the derivation of the CGS method 
while taking the strategy of reducing the number of matrix-vector multiplications 
by introducing auxiliary vector recurrences and changing variables. In Algorithm[3j 
by simple induction, the polynomial representations of the vectors r J; r*, pj, p* at 
step j can be expressed as follows 



Tj = 4>j(A)r , pj = 7Tj(A)r , 
r* = <f>M T y , p) = K 3 {A T y Ql 

where (pj and TTj are Lanczos-type polynomials of degree less than or equal to j 
satisfying (pj(0) = 1. Substituting these corresponding polynomial representations 
into fl26[27| ) gives 



a 3 



Pi 



(r*,A rj ) ( Vj {A T )r*,A Vj (A)r ) (r* , Atf (A) r > 



(ATp*, Ap 3 ) (A T 7Tj {A?) r* , An, (A) r > (r*, AH] (A) r > ' 
(r* +1 ,Ar j+1 ) _ (<p j+1 (A T )r*,A^ +1 (A)r ) _ (r*, Atf +l {A) r > 



(rlArj) " " (^(AT)r* ,A^(A)r ) (r* , Arf (A) r ) ' 

Also, note from ( l~9p0 ) that (pj and tcj can be expressed by the following 



recurrences 



<f> j+ l(t) = (j)j(t) - OijtTTjit), 
7T i+ i(t) = <f) j + 1 {t)+ PjTXjit). 



By some algebraic computation with the help of the induction relations 
between (pj and iTj and the strategy of reducing operations mentioned above, the 
desired CORS method can be obtained. The pseudocode for the resulting left 
preconditioned CORS method with a preconditioner M can be represented by the 



following scheme. In many cases, the CORS method is amazingly competitive 
with the BiCGSTAB method (see e.g. Section [6]). However, the CORS method, 
like the CGS, SCGS, and the CRS methods, is based on squaring the residual 
polynomial. In cases of irregular convergence, this may lead to substantial 
build-up of rounding errors and worse approximate solutions, or possibly even 
overflow (see e.g. example). For discussions on this effect and its consequences, 
see §[T§[20l[2T)[30]. 

Algorithm 4 Left preconditioned CORS method. 
1: Compute r = b — Axq for some initial guess %q. 

2: Choose Tq = P(A)ro such that (r^,Ar ) ^ 0, where P(t) is a polynomial in t. 





(For example, Tq = Atq). 


3 


for j ; = 1, 2, ... do 


4 


solve Mzj-i = rj-i 


5 


f = Azj-i 


G 


Pj-i = ( r o> r > 


7 


if Pj-x = 0, method fails 


8 


if j = 1 then 


9 


e = r 


10 


solve Mzeo = eo 


11 


d = f 


12 


q = f 


13 


else 


14 


Pj-2 =Pj-lfpj-2 


15 


e,-_i = r,-_i + /3j^ 2 hj-2 


1G 


zej-i = Zj— i + (3j-2zhj-2 


17 


dj-i = r + 0j- 2 fj-2 


18 


Qj-i = d j-i + Pj-2 (/j-2 + Pj-lOj-*) 


19 


end if 


20 


solve Mg = g,,_i 


21 


g = 


22 


= Pj-il KA) 


23 


hj-i = — a>j^iqj-i 


24 


zhj-i = zej-i — aj-iq 


25 


fj-i = d j-i ~ 


2G 


Xj = Xj-i + Oij-i (2zej-\ — atj-iq) 


27 


Tj = Tj_i — ctj-i (2dj_i — aj-iq) 


28 


check convergence; continue if necessary 


29 


end for 



6 Numerical experiments 



We initially illustrate the numerical behavior of the proposed algorithms on a 
set of sparse linear systems. Although the focus of the paper is on real unsymmetric 
problems, we also report on experiments on complex systems. The Lanczos 
biconjugate A-orthonormalization procedure is straightforward to generalize to 
complex matrices, see e.g. [12]. We select matrix problems of different size, 
from small (a few tens thousand unknowns) to large (more than one million 
unknowns), and arising from various application areas. The test problems are 
extracted from the University of Florida [2] matrix collection, except the two 
SOMMEL problems that are made available at Delft University. The characteristics 
of the model problems are illustrated in Tables ( 
are shown in Tables [3} The experiments are carried out using double precision 
floating point arithmetic in MATLAB 7.7.0 on a PC equipped with a Intel(R) 
Core(TM)2 Duo CPU P8700 running at 2.53GHz, and with 4 GB of RAM. We 
report on number of iterations (referred to as Iters), CPU consuming time in 
seconds (referred to as CPU), log 10 of the final true relative residual 2-norms 
defined log 10 (||6 — Ax n \ | 2 /| \r \ | 2 ) (referred to as TRR). The stopping criterium 
used here is that the 2-norm of the residual be reduced by a factor TOL of the 
2-norm of the initial residual, i.e., | |r„| | 2 /| |?"o| b < TOL, or when Iters exceeded the 
maximal number of matrix-by- vector products MAXMV. We take MAXMV very 
large (10000) and TOL = 10~ 8 . All these tests are started with a zero initial guess. 
The physical right-hand side is not always available for all problems. Therefore we 
set b = Ae, where e is the n x 1 vector whose elements are all equal to unity, such 
that x = (1, 1, , 1) T is the exact solution. It is stressed that the specific default 
choice for the constraint space L m with Tq = Ar^ is adopted in the implementation 
of the Lanczos biconjugate A-ort honor malizat ion methods. Finally, a symbol '-' is 
used to indicate that the method did not meet the required TOL before MAXMV . 

We observe that the proposed algorithms enable to solve fairly large problems 
in a moderate number of iterations. The iteration count is always much smaller 
than the problem dimension, see e.g. experiments on the large ATMOSMODJ, 
ATMOSMODL, KIM2 problems. The final approximate solution is generally very 
accurate. In all our experiments we did not observe breakdowns in the Lanczos 
A-biortho normalization procedure that proves to be remarkably robust. The setup 
of the methods does not require parameters; however, there is some freedom in the 
selection of the initial shadow residual r$ . We analyse in Table [4] the effect of a 
different choice for the initial shadow residual; performance may slightly change 
but the convergence trend is preserved and in general it is not possible to predict 
a priori the effect. The results indicate that CORS is significantly more robust 



T|2), and the numerical results 



and faster than BiCOR. The proposed family of solvers exhibits fast convergence, 
is parameter-free, is extremely cheap in memory as it is derived from short-term 
vector recurrences and does not suffer from the restriction to require a symmetric 
preconditioner when it is applied to symmetric systems. Therefore, it can be a 
suitable computational tool to use in applications. 

In Tables ([5 18) we illustrate some comparative figures with other popular 
Krylov solvers, that are developed on either the Arnoldi and the Lanczos 
biconjugation methods. The only intention of these experiments is to show the 
level of competiteveness of the presented Lanczos A-biorthonormalization method 
with other popular approaches for linear systems. We consider GMRES(50), Bi- 
CG,QMR, CGS, Bi-CGSTAB, IDR(4), BiCOR, CORS, BiCR. Indeed some of these 
algorithms depend on parameters and we did not search the optimal value on our 
problems. We set the value of restart of GMRES equal to 50. The memory request 
for GMRES is the matrix+(restart-|-5)?7,, so that it remains much larger than the 



memory needed for CORS and BiCOR (see Table 19 ). The parameter value for IDR 
is selected to 4, that is often used in experiments. The CORS method proves very 
fast in comparison, thanks to its low algorithmic cost. We observe the robustness 
of the two algorithms on the STOMMEL1 problem from Ocean modelling. On this 
problem, CORS, BICOR and IDR are equally efficient, IDR being slightly faster; 
however, CORS and BICOR are remarkably accurate. 



Matrix problem 


Size 


Field 


Characteristics 


WATER.TANK 

XENON2 

STOMACH 

TORS03 

LANGUAGE 

MAJORBASIS 

ATMOSMODJ 

ATMOSMODL 


60,740 
157,464 
213,360 
259,156 
399,130 
160,000 
1,270,432 
1,489,752 


3D fluid flow 

Materials 

Electrophysiology 

Electrophysiology 

Natural language processing 

Optimization 

Atmospheric modelling 

Atmospheric modelling 


real unsymmetric 
real unsymmetric 
real unsymmetric 
real unsymmetric 
real unsymmetric 
real unsymmetric 
real unsymmetric 
real unsymmetric 



Table 1: Set of test real matrix problems. 



6.1 dense problems 

In the last decade, iterative methods have become widespread also in dense 
matrix computation partly due to the development of efficient boundary element 
techniques for engineering and scientific applications. Boundary integral equations, 
i.e. integral equations defined on the boundary of the domain of interest, is one of 



Matrix problem 


Size 


Field 


Characteristics 


M4D2 .unseal 

WAVEGUIDE3D 

VFEM 

KIM2 


10,000 
21,036 
93,476 
456,976 


Quantum Mechanics 
Electromagnetics 
Electromagnetics 
Complex mesh 


complex unsymmetric 
complex unsymmetric 
complex unsymmetric 
complex unsymmetric 



Table 2: Set of test complex matrix problems. 



^ ^ Ver /Example 


BiCOR 


CORS 


WATER.TANK 


1711 / 38.66 / -8.01 


1290 / 28.6 / -8.1001 


XENON2 


1247 / 64.79 / -8.0366 


711 / 37.87 / -8.1368 


STOMACH 


82 / 4.33 / -8.1891 


40 / 2.36 / -8.1321 


TORS03 


97 / 6.92 / -8.12 


56 / 4.25 /-8.3181 


LANGUAGE 


34 / 2.54 / -9.4099 


24 / 2.11 / -8.3046 


MAJORBASIS 


60 / 2.22 / -8.0077 


35 / 1.45 / -8.688 


ATMOSMODJ 


416 / 120 / -8.1018 


286 / 110.52 / -8.0685 


ATMOSMODL 


273 / 106.36 / -8.2345 


192 / 87.69 / -8.0312 


M4D2 .unseal 


2324 / 8.38 / -8.0866 


- / 21.43 / 9.1144 


WAVEGUIDE3D 


3874 / 38.96 / -8.0508 


2988 / 38.41 / -8.0485 


VFEM 


4022 / 226.26 / -8.0156 


0.5 / 3.17 / 0.72363 


KIM2 


189 / 53.86 / -8.4318 


105 / 36.99 / -8.0199 



Table 3: Number of iterations, CPU time and log 10 of the true residual on all test 
examples solved with TOL = 10~ 8 using r* = Atq for the shadow residual. 



the largest source of truly dense linear systems in computational science (4|[5J . 
Recent efforts to implement efficiently these techniques on massively parallel 
computers have resulted in competitive application codes provably scalable to 
several million discretization points, e.g. the FISC code developed at University of 
Illinois (24]^26), the INRIA/EADS integral equation code AS_ELFIP (28j[29 
Bilkent University code [6l 



the 

|7J, urging the quest of robust iterative algorithms in 
this area. Integral formulations of surface scattering and hybrid surface/volume 
formulations yield non-Hermitian linear systems that cannot be solved using 
the Conjugate Gradient (CG) algorithm. The GMRES method [19] is broadly 
employed due to its robustness and smooth convergence. In particular, its flexible 
variant FGMRES 



17 



has become also very popular in combination with inner- 
outer solution schemes, see e.g. experiments reported in (T, 13 for solving very 
large boundary element equations. Iterative methods based on the Lanczos 

BiCGSTAB 



biconjugation method, e.g. 



30 and QMR |10| are also considered 



/Example 


BiCOR 


CORS 


WATER.TANK 


1423 / 34.35 / -8.0208 


1246 / 30.36 / -8.1874 


XEN0N2 


1164 / 64.02 / -8.0014 


730 / 39.25 / -8.1178 


STOMACH 


81 / 4.37 / -8.1078 


50 / 2.96 / -8.0726 


TORS03 


87 / 6.19 / -8.2355 


56 / 4.57 / -8.1768 


LANGUAGE 


31 / 2.41 / -8.8309 


23 / 2.09 / -8.2032 


MAJORBASIS 


56 / 2.36 / -8.0939 


35 / 1.43 / -8.7153 


ATMOSMODJ 


393 / 99.44 / -8.0175 


305 / 92.53 / -8.1162 


ATMOSMODL 


279 / 83.86 / -8.0279 


213 / 75.95 / -8.1906 


M4D2 .unseal 


2370 / 8.58 / -8.0888 


5000 / 21.15 / 11.8045 


WAVEGUIDE3D 


3691 / 37.49 / -8.0015 


3009 / 35.66 / -8.0041 


VFEM 


3902 / 171.04 / -8.0013 


4012 / 232.72 / -8.1491 


KIM2 


189 / 51.9 / -8.3155 


105 / 36.89 / -8.0861 



Table 4: Number of iterations, CPU time and log 10 of the true residual on all test 
examples solved with TOL = 10~ 8 using r* = r for the shadow residual. 



Method 


Iters 


CPU 


TRR 


GMRES(50) 


9 


0.39 


-2.4224 


Bi-CG 


2145 


72.17 


-8.0419 


QMR 


2074 


78.56 


-8.0006 


CGS 


2162 


65.8 


-8.1078 


Bi-CGSTAB 


1232 


50.52 


-8.0027 


IDR(4) 


4682 


66.11 


-8.0242 


BiCOR 


1711 


38.66 


-8.01 


CORS 


1290 


28.6 


-8.1001 


BiCR 


1423 


38.09 


-8.0208 



Method 


Iters 


CPU 


TRR 


GMRES(50) 


1 


0.3 


-0.91923 


Bi-CG 


1329 


104.32 


-8.0035 


QMR 


1165 


104.02 


-8.0014 


CGS 


968 


69.52 


-8.258 


Bi-CGSTAB 


866.5 


85.08 


-8.1773 


IDR(4) 


2148 


76.41 


-8.1822 


BiCOR 


1247 


64.79 


-8.0366 


CORS 


711 


37.87 


-8.1368 


BiCR 


1164 


74.48 


-8.0014 



Table 5: Comparison results of Table 6: Comparison results of XENON2 
WATER.TANK with TOL = 10~ 8 . with TOL = 10~ 8 . 



Method 


Iters 


CPU 


TRR 


r T MRF, c if i in , i 

VjlvllVll/kJ I o\j I 


7^ 

1 O 


90 


_8 4944 


Bi-CG 


80 


6.52 


-8.1185 


QMR 


82 


7.75 


-8.2002 


CGS 


50 


3.69 


-8.0306 


Bi-CGSTAB 


149.5 


16.1 


-8.0366 


IDR(4) 


107 


5.38 


-8.3001 


BiCOR 


82 


4.33 


-8.1891 


CORS 


40 


2.36 


-8.1321 


BiCR 


81 


5.61 


-8.1078 



Method 


Iters 


CPU 


TRR 


vjlvll VJ_-/kJ I UvJ ) 


87 


9Q Q4 


-8 9fi77 


Bi-CG 


84 


8.83 


-8.0773 


QMR 


91 


11.12 


-8.0037 


CGS 


78 


7.55 


-8.1121 


Bi-CGSTAB 


105.5 


14.08 


-8.0388 


IDR(4) 


113 


6.66 


-8.0402 


BiCOR 


97 


6.92 


-8.12 


CORS 


56 


4.25 


-8.3181 


BiCR 


87 


7.82 


-8.2355 



Table 7: Comparison results 
STOMACH with TOL = 10~ 8 . 



of 



Table 8: Comparison results of TORS03 
with TOL = 10~ 8 . 



Method 


Iters 


CPU 


TRR 




Method 


Iters 


CPU 


TRR 


GMRES(50) 


29 


8.91 


-8.112 




GMRES(50) 


41 


6.61 


-6.9214 


Bi-CG 


30 


3.43 


-8.1769 




Bi-CG 


59 


3.3 


-8.2861 


QMR 


30 


4.24 


-8.1875 




QMR 


59 


3.99 


-8.1854 


CGS 


25 


2.66 


-10.6201 




CGS 


34 


1.77 


-8.2714 


Bi-CGSTAB 


24 


3.58 


-8.0851 




Bi-CGSTAB 


28.5 


2.06 


-8.0297 


IDR(4) 


38 


2.76 


-8.8881 




IDR(4) 


58 


2.08 


-8.2353 


BiCOR 


34 


2.54 


-9.4099 




BiCOR 


60 


2.22 


-8.0077 


CORS 


24 


2.11 


-8.3046 




CORS 


35 


1.45 


-8.688 


BiCR 


31 


3.3 


-8.8309 




BiCR 


56 


2.73 


-8.0939 



Table 9: Comparison results of 
LANGUAGE with TOL = 10~ 8 . 



Table 10: Comparison results 
MAJORBASIS with TOL = 10~ 8 . 



of 



Method 


Iters 


CPU 


TRR 


GMRES(50) 


175 


336.46 


-3.02 


Bi-CG 


432 


185.34 


-8.0037 


QMR 


432 


210.6 


-8.0712 


CGS 


294 


132.08 


-8.4498 


Bi-CGSTAB 


266 


143.87 


-8.4199 


IDR(4) 


540 


135.58 


-8.2371 


BiCOR 


416 


120 


-8.1018 


CORS 


286 


110.52 


-8.0685 


BiCR 


393 


166.91 


-8.0175 



Method 


Iters 


CPU 


TRR 


GMRES(50) 


71 


123.98 


-2.5268 


Bi-CG 


296 


141 


-8.4055 


QMR 


284 


168.65 


-8.0294 


CGS 


240 


106.3 


-8.0073 


Bi-CGSTAB 


170 


105.96 


-8.0032 


IDR(4) 


298 


114.27 


-8.1764 


BiCOR 


273 


106.36 


-8.2345 


CORS 


192 


87.69 


-8.0312 


BiCR 


279 


120.81 


-8.0279 



Table 11: Comparison results of 
ATMOSMODJ with TOL = 10~ 8 . 



Table 12: Comparison results of 
ATMOSMODL with TOL = 10" 8 . 



Method 


Iters 


CPU 


TRR 


VJlvll VJ_-/kJ I O \J I 


OOOvJ 


Qnq 41 

OUj.11 


-fl 861 ^8 

U.OUIOO 


Bi-CG 


3742 


43.09 


-8.0516 


QMR 


3479 


52.07 


-8.1005 


CGS 


1286 


54.39 


-0.53875 


Bi-CGSTAB 


2300 


33.26 


-8.2957 


IDR(4) 


2889 


17.86 


-6.4891 


BiCOR 


2397 


18.43 


-8.019 


CORS 


2155 


17.94 


-7.7396 


BiCR 


2583 


28.33 


-8.6085 



Table 13: Comparison results of 
STOMMEL1 with TOL = 10" 8 . 



Method 


Iters 


CPU 


TRR 


GMRES(50) 


3127 


44.69 


-8.0001 


Bi-CG 


2488 


14.17 


-8.0319 


QMR 


2497 


18.25 


-8.0226 


CGS 




28.74 


-0.10264 


Bi-CGSTAB 




39.46 


-4.7277 


IDR(4) 




12.14 


-8.0658 


BiCOR 


2324 


8.38 


-8.0866 


CORS 




21.43 


9.1144 


BiCR 


2370 


12.36 


-8.0888 



Table 15: Comparison results of 
M4D2_unscal with TOL = 10" 8 . 



Method 


Iters 


CPU 


TRR 


GMRES(50) 




2363.42 


-6.3922 


Bi-CG 


4744 


337.54 


-7.9757 


QMR 


3786 


307.21 


-8.0008 


CGS 


4709 


340.64 


-5.4999 


Bi-CGSTAB 


1941 


443.64 


-6.809 


IDR(4) 




360.89 


5.9272 


BiCOR 


3593 


154.87 


-8.0113 


CORS 


4022 


226.26 


-8.0156 


BiCR 


3902 


225.49 


-8.0013 



Method 


Iters 


CPU 


TRR 


OMRES^lTl 

Vjrlvll Vll/O I 0\J J 




78 fi4 


-9 41 74 


Bi-CG 


1515 


4.14 


-8.0953 


QMR 


1492 


5.54 


-8.0046 


CGS 


2520 


11.63 


-0.79492 


Bi-CGSTAB 


899.5 


3.2 


-8.4985 


IDR(4) 


1119 


1.61 


-8.3146 


BiCOR 


1302 


2.38 


-8.0514 


CORS 


936 


1.84 


-8.1765 


BiCR 


1223 


3.37 


-8.084 



Table 14: Comparison results of 
STOMMEL2 with TOL = 10" 8 . 



Method 


Iters 


CPU 


TRR 


GMRES(50) 




299.8 


-6.9859 


Bi-CG 


3851 


57.93 


-8.1015 


QMR 


3752 


68.24 


-8.0482 


CGS 


3013 


44.88 


-8.0216 


Bi-CGSTAB 


4957 


101.45 


-4.9052 


IDR(4) 




70.27 


-6.2982 


BiCOR 


3874 


38.96 


-8.0508 


CORS 


2988 


38.41 


-8.0485 


BiCR 


3691 


49.07 


-8.0015 



Table 16: Comparison results of 
WAVEGUIDE3D with TOL = 10~ 8 . 



Method 


Iters 


CPU 


TRR 


GMRES(50) 


40 


53.08 


-3.0378 


Bi-CG 


189 


79.99 


-8.0716 


QMR 


195 


94.95 


-8.5199 


CGS 


105 


43.42 


-8.1447 


Bi-CGSTAB 


133 


75.68 


-8.0007 


IDR(4) 


187 


46.44 


-8.067 


BiCOR 


189 


53.86 


-8.4318 


CORS 


105 


36.99 


-8.0199 


BiCR 


189 


66.64 


-8.3155 



Table 17: Comparison results of VFEM 
with TOL = 10~ 8 . 



Table 18: Comparison results of KIM2 
with TOL = 10~ 8 . 



Solver 


Type 


Products by A Products by A T 


Scalar products 


Memory 


BiCOR 


general 


1 1 


2 


matrix+lOn 


CORS 




2 


2 


matrix+14n 



Table 19: Algorithmic cost and memory expenses per iteration for Lanczos A- 
biorthonormalization methods for linear systems. 



in many studies but they may need many more iterations to converge especially on 
realistic geometries [1,14 



In this study we report on results of experiments with Lanczos A- 
biorthonormalization methods on four dense matrix problems arising from 
boundary integral equations in electromagnetic scattering from large structures. 
An accurate solution of scattering problems is a critical concern in civil aviation 
simulations and stealth technology industry, in the analysis of electromagnetic 
compatibility, in medical imaging, and other applications. The selected linear 
systems arise from reformulating the Maxwell's equations in the frequency domain 
as the following variational problem: 

Find the distribution of the surface current j induced by an incoming radiation, 
such that for all tangential test functions j* we have 



1 U - -<1 > ( JO) • ?{y) - ^ div rJ(>) • div r J*(y) ) dxdy 




r Jr 



kZ, 



E inc (x) ■ f(x)dx. (28) 



o Jr 



e ik\y-x\ 

We denote by G(\y — x\) = — . r the Green's function of Helmholtz 

Air\y — x\ 

equation, T the boundary of the object, k the wave number and Z = a/^oT^o 
the characteristic impedance of vacuum (e is the electric permittivity and [i the 



magnetic permeability). Boundary element discretizations of (28) over a mesh 
containing n edges produce dense complex non-Hermitian systems Ax = b. The set 
of unknowns are associated with the vectorial flux across an edge in the mesh, while 
the right-hand side varies with the frequency and the direction of the illuminating 
wave. When the number of unknowns n is related to the wavenumber, the iteration 
count of iterative solvers typically increase as O(n ' 5 ) 124). Eqn (28) is known as 



the Electric Field Integral Equation. Other possible integral models, e.g. the 
Combined Field Integral Equation or the Magnetic Field Integral Equation, only 
apply to closed surfaces and are easier to solve by iterative methods 14 . Therefore, 



we stick with Eqn. (28). We report the characteristics of the linear systems in 



Table 20 and we depict the corresponding geometries in Figure [T} In addition to 
BiCOR and CORS, we consider complex versions of BiCGSTAB, QMR, GMRES. 
Indeed these are the most popular Krylov methods in this area. The runs are done 
on one node of a Linux cluster equipped with a quad core Intel CPU at 2.8GHz and 
16 GB of physical RAM. using a Portland Group Fortran 90 compiler version 9. 

In Table 21 , we show the number of iterations required by Krylov methods to 
reduce the initial residual to O(10~ 5 ) starting from the zero vector. The right-hand 
side of the linear system is set up so that the initial solution is the vector of all 
ones. We carry out the M-V product at each iteration using dense linear algebra 
packages, i.e. the ZGEMV routine available in the LAPACK library and we do not 
use preconditioning. We observe again the remarkable effectiveness of the CORS 
method, that is the fastest non-Hermitian solver with respect to CPU time on most 
selected examples except GMRES with large restart. Indeed, unrestarted GMRES 
may outperform all other Krylov methods and should be used when memory is not a 
concern. However reorthogonalization costs may penalize the GMRES convergence 
in large-scale applications, so using high values of restart may not be convenient 
(or even not affordable for the memory) as shown in earlier studies [lj. In Table 21 
we select a value of 50 for the restart parameter. 

The good efficiency of CORS is even more evident on the two realistic aircraft 
problems i.e. Examples 3-4 which are very difficult to solve by iterative methods as 



no convergence is obtained without preconditioning in 3000 iterates. In Table 22 



we report on the number of iterations and on the CPU time to reduce the 
initial residual to O(l0~ 3 ). This tolerance may be considered accurate enough 
for engineering purposes. In [I] it has been shown that a level of accuracy of 
0(1O~ 3 ) may enable a correct reconstruction of the radar cross section of the 
object. Again, CORS is more efficient than restarted GMRES on these two tough 
problems. The BiCOR method also shows fast convergence. However, a nice feature 
of CORS over BiCOR is that it does not require matrix multiplications by A H on 
complex systems. This may represent an advantage when MLFMA is used because 
the Hermitian product often requires a specific algorithmic implementation [28] . 
In Figures [2] we illustrate the convergence history of CORS and GMRES(50) on 
Examples 2 to show the different numerical behavior of the two families of solvers. 
The residual reduction is much smoother for GMRES along the iterations. We 
observe that in our experiments, BiCGSTAB and unsymmetric QMR algorithms 
generally converge more slowly than BiCOR and CORS. 

The large spectrum of real and complex linear systems, in both sparse and dense 
matrix computation, reported in this study illustrate the favourable numerical 
properties of the proposed unsymmetric variant of the Lanczos procedure for linear 
systems. The results indicate that our computational techniques are capable to 
solve very efficiently a large variety of problems. Therefore, they can be a suitable 



Example Description Size Memory (Gb) Frequency (MHz) Ki(A) Geometry 



1 


Sphere 


12000 


4.6 


535 


6 


O(10 5 ) 


Fig. 


1(a) 




2 


Satellite 


1699 


0.1 


57 


1 


e>(io 5 ) 


Fig. 


1(b) 




3 


Jet prototype 


7924 


2.0 


615 


1 


O(10 7 ) 


Fig. 


1(c) 




4 


Airbus A3 18 prototype 


23676 


18.0 


800 


1 


O(10 7 ) 


Fig. 


1(d) 





Table 20: Characteristics of the model problems. 



^°^ Ver /Example 




1 




2 


CORS 


380 


(211) 


371 


(11) 


BiCOR 


441 


(251) 


431 


(15) 


GMRES(50) 


> 3000 


(> 844) 


871 


(17) 


QMR 


615 


(508) 


452 


(24) 


TFQMR 


399 


(435) 


373 


(27) 


BiCGSTAB 


764 


(418) 


566 


(18) 



Table 21: Number of iterations and CPU time (in seconds) required by Krylov 
methods to reduce the initial residual to 0(1CT 5 ). 



Example/ golver I CQRS GMRES(50) 

3 1286 (981) >3000 (>1147) 

4 924 (5493) 2792 (8645) 



Table 22: Number of iterations and CPU time (in seconds) for CORS and 
GMRES(50) to reduce the initial residual to (9(10~ 3 ) on the two aircraft 
problems 1(c) and 1(d) These problems do not converge in 3000 iterations. 



(a) Sphere 



(b) Satellite 




(c) Jet prototype (d) Airbus A3 18 prototype 

Figure 1: Meshes associated with test examples. 



numerical tool to use in applications. 
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Figure 2: Convergence histories for CORS and GMRES. 
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